Airfares

Part A.

A simple predictive model of the target variable - "simple" meaning choose just ONE explanatory variable.

How did you choose the explanatory variable?

By observing the heatmap, we can find DISTANCE has the highest correlation coefficient with Fare among all variables, so I choose DISTANCE as my explanatory variable.

Does your model under or overfit the data? How do you know?

Since R-square is high on training data and test data, and MAE nad MAPE in training data are lower than in test data, this model perform well.

Part B.

Create a slightly more complicated predictive model of the target variable. In particular, add 1-3 more variables that you think have potential to improve your model.

By observing the heatmap, we can find the four variables with the highest correlation coefficient with Fare in descending order are DISTANCE, COUPON, E_INCOME, E_POP, and VACATION_Yes. We remove COUPONS from our explanatory variable list because COUPON has a high correlation coefficient with DISTANCE, which can cause MulitCollinearity in regression.

Take note of any differences in model performance from 1. to 2.

R-square in the new model is higher than in the old model, and MAE, MAPE, and SSE in the new model are lower than in the old model. However, since MAE and MAPE in new model's training set are lower than in test set, this new model is overfitting.

Do you notice any major changes in the magnitudes of your parameter estimates?

Compared to the first regression model, DISTANCE in this new model has a lower coefficient, and VACATION_Yes has the highest coefficient in this new model.

Pick one parameter estimate and, in words, describe what it means?

The coefficient of DISTANCE means for every one unit increase in DISTANCE, the FARE will increase by 0.65 dollars.

Part C.

Add all potential explanatory variables to your model and any data transformations you think could be helpful. Use Ridge or Lasso regression in collaboration with Cross-Validation to arrive at a final model form. Note: your use of the methods above should result in some parameters dropping out of your model - take note of which parameters and associated variables are important to good model fit and a low degree of model variability.

The Final Model is the best model with a low MAPE score, a low MAE score, and a high R-square in both train and test data, which is 0.782 and 0.793. Variables DISTANCE, GATE_Free, E_POP, HI, PAX, S_POP, SLOT_Free, SW_Yes, and VACATION_Yes are important to good model and a low degree of model variability.